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Abstract — We analyse the prequential plug-in codes relative 
to one-parameter exponential families Ai. We show that if data 
are sampled i.i.d. from some distribution outside Ai, then the 
redundancy of any plug-in prequential code grows at rate larger 
than | In n in the worst case. This means that plug-in codes, such 
as the Rissanen-Dawid ML code, may behave inferior to other 
important universal codes such as the 2-part MDL, Shtarkov and 
Bayes codes, for which the redundancy is always | Inn + O(l). 
However, we also show that a slight modification of the ML 
plug-in code, "almost" in the model, does achieve the optimal 
redundancy even if the the true distribution is outside M. 

I. Introduction 

We resolve two open problems from [1] concerning uni- 
versal codes of the predictive plug-in type, also known as 
"prequential" codes. These codes were introduced indepen- 
dently by Rissanen [2] in the context of MDL learning and 
by Dawid [3], who proposed them as probability forecasting 
strategies rather than directly as codes. Roughly, the plug-in 
codes relative to parametric model Ai — {Me \ 9 € 0} 
work by sequentially coding each outcome Xi based on an an 
estimator = (9(a; 4_1 ) for all previous outcomes x 1 ^ 1 = 
x%, . . . , Xi-x, leading to codelength (log loss) — In Mg (xj), 
where Mg denotes the probability density or mass function 
indexed by 9. If we take 9i — 9i equal to the ML (maximum 
likelihood) estimator, we call the resulting code the "ML plug- 
in code". 

There are many papers about the redundancy and/or ex- 
pected regret for the ML plug-in codes, for a large variety 
of models including multivariate exponential families, ARMA 
processes, regression models and so on. Examples are [4], 
[5], [6]. In all these papers the ML plug-in code is shown 
to achieve an asymptotic expected regret or redundancy of 
| Inn + O(l), where k is the number of parameters of the 
model and n is the sample size. This matches the behaviour 
of the Shtarkov, Bayesian and two-part universal codes and is 
optimal in several ways, see [7]; since the ML plug-in codes 
are often easier to calculate than any of these other three 
codes, this appears to be a strong argument for using them 
in practical data compression and MDL-style model selection. 
Yet, more recently [8], [9], [10], it was shown that, at least 
for single-parameter exponential family models, when the data 
are generated i.i.d. ~ P, the redundancy in fact grows as 
^lnn • varpJ S , where M is the distribution in Ai that is 
closest to P in Kullback-Leibler divergence, i.e. it minimizes 
£>(-P||A/); a related result for linear regression is in [11]. In 
contrast to the other cited works, [8], [9], [10], [11] do not 



assume that P 6 Ai: the model may be misspecified. Yet 
if P G Ai, then we have M = P so that the redundancy 
grows like it does in the other universal models. But when 
M ^ P, the Shtarkov, Bayes and universal codes typically 
still achieve asymptotic expected regret ^ In n, whereas the 
plug-in codes behave differently. [8], [10] show that this leads 
to substantially inferior performance of the plug-in codes in 
practical MDL model selection. 

A. The Two Open Problems/Conjectures 

In general, the estimator for Ai based on a; 4-1 need not 
be an element of the parametric model Ai', for example, 
we may think of the Bayesian predictive distribution as an 
estimator relative to Ai, even though it is "out-model": rather 
than a single element of Ai, it is a mixture of distributions 
in Ai, each weighted by their posterior density (see Sec- 
tion [IV] for an example). We may thus re-interpret Bayesian 
universal codes as prequential codes based on "out-model" 
estimators. From now on, we reserve the term "prequential 
plug-;n code", abbreviated to just "plug-in code", for codes 
based on "/n-model" estimators, i.e. estimators required to 
lie within Ai. When we call a code just "prequential", it 
may be sequentially constructed from either in-model or out- 
model estimators. [9] established a nonstandard redundancy, 
different from (fc/2)lnn, only for ML and closely related 
plug-in codes. [1, Open Problem Nr. 2] conjectured that a 
similar result should hold for all plug-in codes, even if they 
are based on in-model estimators very different from the ML 
estimator: the conjecture was that no plug-in code can achieve 
guaranteed redundancy of (fc/2) Inn if data are i.i.d. ~ P and 
P 7^ M. Our first main result, Theorem Q] below, shows that, 
essentially, this conjecture is true for general one-parameter 
exponential families (k = 1). Specifically, the redundancy can 
become much larger than (l/2)lnnifPg'A1. 

The second related conjecture [1, Open Problem Nr. 3] 
concerned the fact that for the normal location family with 
constant variance a 2 , the Bayesian predictive distribution 
based on data x 1 ^ 1 and a normal prior looks "almost" like 
an in-model estimator for x 1 ^ 1 , and hence the resulting code 
looks "almost" like a plug-in code: the Bayes predictive 
distribution is equal to the normal distribution for Xi with 
mean equal to the ML estimator ^(x 1 ^ 1 ) but with a variance 
of order a 1 + 0(1 /n), i.e. slightly larger than the variance a 2 
of (see Section IPvH for details). Since the Bayesian 

predictive distribution does achieve the redundancy (1/2) Inn 



even if P $ M., this means that if M. is the normal 
location family, then there does exist an "almost" in-model 
estimator (i.e. a slight modification of the ML estimator) 
that does achieve (1/2) Inn even if P $ A4. Although this 
example does not extend straightforwardly to other exponential 
families, [1] conjectured that there should nevertheless be 
some general definition for "almost" in-model estimators that 
achieve (fc/2) Inn redundancy even if P $ A4. Here we show 
that this conjecture is true, at least if k = 1: we propose 
the slightly squashed ML estimator, a modification of the 
ML estimator that puts it slightly outside model AL and in 
Theorem [2] we show that this estimator achieves (1/2) Inn 
redundancy even if P $ M.. This result is important in practice 
since, in contrast to the Bayesian predictive distribution, the 
slightly squashed ML estimator is in general just as easy to 
compute as the ML estimator itself. 

II. Notation and Definitions 

Throughout this text we use nats rather than bits as units of 
information. A sequence of outcomes z\, . . . , z n is abbreviated 
to z n . We write Ep as a shorthand for Ez~p, the expectation 
of Z under distribution P. When we consider a sequence of 
n outcomes independently distributed ~ P, we use Ep even 
as a shorthand for the expectation of (Z\, ■ ■ ■ , Z n ) under the 
n-fold product distribution of P. Finally, P(Z) denotes the 
probability mass function of P in case Z is discrete-valued, 
and it denotes the density of P, in case Z takes its value in 
a continuum. When we write 'density function of Z\ then, 
if Z is discrete-valued, this should be read as 'probability 
mass function of Z\ Note however that in our second main 
result, Theorem |2] we do not assume that the data-generating 
distribution P admits a density. 

Let Z be a set of outcomes, taking values either in a finite or 
countable set, or in a subset of fc-dimensional Euclidean space 
for some k > 1. Let X : Z — > M. be a random variable on Z, 
and let X = {x £ K : 3z £ Z : X(z) = x} be the range of 
X. Exponential family models are families of distributions on 
Z defined relative to a random variable X (called 'sufficient 
statistic') as defined above, and a function h : Z [0,oo). 
Let Z(n) :— J zeZ e~ vX ^h(z)dz (the integral to be replaced 
by a sum for countable Z), and 9 nat := {n £ R : Z(n) < oo}. 

Definition 1 (Exponential family): The single parameter 
exponential family [12] with sufficient statistic X and carrier 
h is the family of distributions with densities M v (z) := 



-r,X(z) 



h(z), where n £ © na t- ©nat is called the natural 



parameter space. The family is called regular if G na t is an 
open interval of R. 

In the remainder of this text we only consider single 
parameter, regular exponential families, but this qualification 
will henceforth be omitted. Examples include the Poisson, 
geometric and multinomial families, and the model of all 
Gaussian distributions with a fixed variance or mean. 

The statistic X(z) is sufficient for n [12]. This suggests 
reparameterizing the distribution by the expected value of X, 
which is called the mean value parameterization. The function 



/i(n) = Em v [X] maps parameters in the natural parameteriza- 
tion to the mean value parameterization. It is a diffeomorphism 
(it is one-to-one, onto, infinitely often differentiable and has 
an infinitely often differentiable inverse) [12]. Therefore the 
mean value parameter space mea n is also an open interval 
of R. We write M = {M M | p £ mean } where M M is the 
distribution with mean value parameter p. 

We are now ready to define the plug-in universal model. 
This is a distribution on infinite sequences Z\,Z2,... £ Z°°, 
recursively defined in terms of the distributions of Z n+ i 
conditioned on Z n — z n , for all n = 1, 2, .... In the definition, 
we use the notation Xi :— X(zi). Note that we use the 
term "model" both for a single distribution ("plug-in universal 
model", a common phrase in information theory) and for a 
family of distributions ("statistical model", a common phrase 
in statistics). 

Definition 2 (Plug-in universal model): Let M. = {Mp, \ 
M <= ©mean} be an exponential family with mean value 
parameter domain 6 mean . Given M, constant ^ £ ©mean and 
a sequence of functions piz 1 ), fl(z 2 ), . . ., such that p,(z n ) =: 
fin £ ©mean, we define the plug-in universal model (or plug-in 
model for short) U by setting, for all n, all z n+1 £ Z n+1 : 

U(z n+1 | z n ) = Af A „(z n+ i), 

where U(z n+ i | z n ) is the density/mass function of z n +i 
conditional on Z n = z n . 

We usually refer to plug-in universal model in terms of 
the codelength function of the corresponding plug-in universal 
code: 

n — 1 n — 1 

L u(z") = YLu(z i+1 | Zi) - Y-\nM fii (z i+1 ). (1) 



i=0 



i=0 



The most important plug-in model is the ML (maximum 
likelihood) plug-in model, defined as follows: 

Definition 3 (ML plug-in model): Given M and constants 
a^o £ ©mean and > 0, we define the ML plug-in model U 
by setting, for all n, all z n+1 £ Z n+1 : 



U(z n+ i | z n ) = M A(x n)(z„ + i), 



where 



(i(z n ) = p n 



xq ■ n + Ya=i x % 



(2) 



n + n 

To understand this definition, note that for exponential 
families, for any sequence of data, the ordinary maximum 
likelihood parameter is given by the average n" 1 ^^^ of 
the observed values of X [12]. Here we define our plug-in 
model in terms of a slightly modified maximum likelihood 
estimator that introduces a 'fake initial outcome' xq with 
multiplicity no in order to avoid infinite code lengths for the 
first few outcomes (a well-known problem sometimes called 
the "inherent singularity" of predictive coding [7], [1]) and to 
ensure that the plug-in ML code of the first outcome is well- 
defined. In practice we can take no = 1 but our result holds 
for any no > 0. 



Definition 4 (Relative redundancy): Following [13], [8], 
we define relative redundancy with respect to P of a code 
U that is universal on a model Ai, as: 

n u {n):=E P [L u {Z n )]- inf E P [- lnM^Z")], (3) 

M 6 ©mean 

where L;y is the length function of £/. 

We use the term relative redundancy rather than just redun- 
dancy to emphasize that it measures redundancy relative to 
the element of the model that minimizes the codelength rather 
than to P, which is not necessarily an element of the model. 
From now on, we only consider P under which the data are 
i.i.d. Under this condition, let M M » be the element of Ai that 
minimizes KL divergence to P: 

p* := arg min D(P||M„) = arg min E P [-lnMJZ)], 

where the equality follows from the definition of KL diver- 
gence. If M M » exists, it is unique, and if J5p[Jf] G O me an, 
then p* = Ep[X] [1, Ch. 17], and the relative redundancy 
satisfies 



Ku(n) = E P [Lu(Z n )} - E P [-\nM^(Z n )} 



(4) 



III. First Result: Redundancy of Plug-In Codes 

The three major types of universal codes, Bayes, NML and 
2-part, achieve relative redundancies that are (in an appropriate 
sense) close to optimal. Specifically, under the conditions on 
Ai described above, and if data are i.i.d. ~ P, then, under 
some mild conditions on P, these universal codes satisfy: 



Ku(n) = i Inn + 0(1), 



(where the 0(1) may depend on p and the universal code 
used), whenever P € Ai or P g" Ai. (f5J) is the famous 'fc 
over 2 log n formula' (k = 1 in our case), refinements of 
which lie at the basis of practical approximations to MDL 
learning [1]. 

While it is known that for P 6 Ai, the fourth major type 
of universal code, the ML plug-in code, satisfies (0) as well, 
it was shown by [8], [9] that when P is not in the model, 
the ML plug-in code may behave suboptimally. Specifically, 
its relative redundancy satisfies: 

W = i-^- Inn + 0(1), (6) 
A varM M , A 

and can be significantly larger than ©, when the variance of 
P is large. 

In this paper, we show that not only the ML plug-in 
code, but every plug-in code may behave suboptimally, when 
P (£ Ai. In other words, modifying the ML estimator p n 
or introducing any other sequence of estimators p n , and 
constructing the plug-in code based on that sequence will 
not help to satisfy ©. Thus the optimal redundancy can only 
be achieved by codes outside Ai, unless Ai is the Bernoulli 
family (since we assume the data are i.i.d., in the Bernoulli 
case we must have that P G Ai; but the Bernoulli case is the 
only case in which we must have P G Ai). 



Our main result, Theorem Q] concerns the case in which P 
is itself a member of some exponential family V, but V is 
in general different than Ai. Then, the suboptimal behavior 
of plug-in codes follows immediately as Corollary Q] stated 
further below. 

Theorem 1: Let Ai = {M^ | p G ©mean} and V = {P M | 
p G Omean} be single parameter exponential families with the 
same sufficient statistic X and mean-value parameter space 
©mean- Let U denote any plug-in model with respect to Ai 

based on the sequence of estimators po, pi, p2, Then, for 

Lebesgue almost all p* G ©mean (i-e. all apart from a Lebesgue 
measure zero set), for X, X\, X2, ■ ■ ■ i.i.d. 



G V: 



i Inn 



var M(j , X 

Proof: (rough sketch; a detailed proof is in the Appendix) 
The proof is based on a theorem stated by Rissanen [14] 
(see also [1], Theorem 14.2), a special case of which says 
the following. Let Qq C ©mean be a closed, non-degenerate 

(n) 

interval, V be defined as above, P^ ' be a joint distribution 
of n outcomes generated i.i.d. from P^, Q be an arbitrary 
probabilistic source, i.e. a distribution on infinite sequences 
21, Z2, ■ ■ ■ G Z°°, and let be its restriction to the first n 



outcomes. Define: g n {n*) = 
almost all a* G ©o, liminf„ 



D(P^\\C 



Then for Lebesgue 



^ In n 

,g n (p*) > 1. 

We apply Rissanen's theorem by constructing a source Q, 
specifying the conditional probabilities Q(z n+ i\z n ) := P^ n , 
for every n > 1. We now have: 



(5) D(P^\\Q^) = £ E P ^ [\nP^(Z l+1 ) - lnQ(Z 4+1 |Z< ; 



i=0 



(7) 



To see how (O is related to our case, let us first rewrite the 
redundancy in a more convenient form: 



n-l 

E 

i=0 



E P . [£>(M„. || M P J] 



(8) 



The derivation of (JHJ make use of a standard result in the 
theory of exponential families and can be found e.g. in [1]. 

Comparing (O and (JHJ, we see that although in both 
expressions, the expectation is taken with respect to P M » , (|7]i 
is a statement about KL divergence between the members of 
V, while © speaks about the members of Ai. The trick, 
which allows us to relate both expressions, is to examine 
their second-order behavior. By expanding D{P fl - ||-PpJ into 
a Taylor series around /i*, we get: 

DiP^WP-^^O+D^in^-pn + ^^m-p*) 2 , 

where we abbreviated Z)W(^) = -f % D{P ll *\\P IJ ). The term 
is zero, since D(p*\\u) as a function of u has 
its minimum at p — p* [12]. As is well-known [12], for 
exponential families the term (p) coincides precisely with 



the Fisher information Ij>([i) evaluated at /i. Another standard 
result [12] for the mean-value parameterization says that for 
all fj,, I v (fi) = var ^ x . Therefore, we get D(P^»\\P Jli ) ~ 



1 Uh- 



and similarly, D(M^ ||M Pi ) 

varp ^ X 



2 varj= , X 



so that 



2 yaiM t * X ' 

, and using (0 and ®: 



varp * X 



^(n)^D(P^||QW) 
The last step of the proof is to use Rissanen's theorem and 



conclude that liminf„_ 



Kujn) 
■ In n 



is equal to 



lim irrL- 



d(p<:'hq<">) 



X 



X 



> 



X 



.X' 



for Lebesgue almost all /J,* € @o, and thus for Lebesgue 

almost all U* S ©mean- ■ 

We now use Theorem[T]to show that the redundancy of plug-in 
codes is suboptimal for all exponential families which satisfy 
the following very weak condition: 

Condition 1: Let M = {M^ \ [i S 6 me an} be a single 
parameter exponential family with sufficient statistic X and 
mean-value parameter space © me an- We require that there exists 
another single-parameter exponential family V = {P^ | /i G 
©mean} with the same mean-value parameter space as M., but 
with strictly larger variance than M for every /i S ©mean- 

The Condition Q] is widely satisfied among known expo- 
nential families. When X — [a,b], we define P^ to be a 
"scaled" Bernoulli model, by putting all probability mass 
on {a, b} in such a way that Ep = /i. It is easy to 
show, that such distribution has the highest variance among 
all distributions defined on [a, b] with a given mean value 
/i; therefore varp X > vaiM^X, unless M. is a "scaled" 
Bernoulli itself. When X = R, V can be chosen to be a 
normal family with fixed, sufficiently large variance a 2 . For 
X = [0, oo), V can be taken to be a gamma family with 
sufficiently large scale parameter. When X = {0, 1,2,.. .}, V 
can be taken to be negative binomial (with expected "number 
of successes" sufficiently small). 

Thus, we see that for all commonly used exponential 
families, except for Bernoulli, Condition [T] holds. On the other 
hand if M. is Bernoulli, Corollary Q] is no longer relevant 
anyway, since then P must lie in M. 

Corollary 1: Let M = {M^ \ \i £ 6 me an} a single 
parameter exponential family with sufficient statistic X and 
mean-value parameter space 6 me an, satisfying ConditionQ] Let 
U denote any plug-in model with respect to M. based on any 
sequence of estimators fix, p,2, ■ ■ ■■ Then, there exists a family 
of distributions V = {P^ \ [i G ©mean}, such that for Lebesgue 
almost all [i* <E ©mean, for X,Xx,X 2 , ■ ■ . i.i.d. ~ P^* : 



lim inf 



Ku{n) 1 varp^.X 1 
In n ~ 2 varjv/ * X 2 ' 



so that the set of /i* for which U achieves the regret ^ In n - 
0(1) is a set of Lebesgue measure zero. 

Proof: Immediate from Theorem [T] I 



IV. Second Result: Optimality of Squashed ML 

We showed that every plug-in code, including the ML plug- 
in code, behaves suboptimally for 1 -parameter families M 
unless A4 is Bernoulli. This fact does not, however, exclude 
the possibility that a small modification of the ML plug-in 
code, which puts the predictions slightly outside Ai, will 
lead to the optimal redundancy (0. An argument supporting 
this claim comes from considering the Bayesian predictive 
distribution when M. is the normal family with fixed variance 
a 2 . In this case, the Bayesian code based on prior Af(no, Tq ) 
has a simple form [1]: 



Lh 



Bayes 



(z n +l | Z n ) = f llntT 2 +a 2(z n+1 ), 



where / M CT 2 is the density of normal distribution Af(/i,a 2 ), 



/'r, 



and t 2 



Thus, the Bayesian predictive distribution is itself a Gaussian 
with mean equal to the modified maximum likelihood estima- 
tor (with no = a 2 /tq), albeit with a slightly larger variance 
(T 2 + 0(1 /n). This shows that for the normal family with 
fixed variance, there exists an "almost" in-model code, which 
satisfies (fSJ. This led [1] to conjecture that something similar 
holds for general exponential families. Here we show that this 
is indeed the case: we propose a simple modification of the ML 
plug-in universal model, obtained by predicting z n+ i using a 
slightly "squashed" version M'^ of the ML estimator Mjx n , 
defined as: 



KS*n+i) :=M An (z n+ i) 



1 + ^j I M(P-n)(x n +l - Vnf 



where fi n is defined as in (0 and Im(i^) is the Fisher 
information for model M.. Note that M'^ (z n+ i)(-) repre- 
sents a valid probability density: it is non-negative due to 
^x(An) > (property of exponential families), and it is 
properly normalized: 

f x ML n (z n+1 )(z)dz = (1 + ($ x Mf, n {z)dz 

+±I M (M f x (X(z) - ii n ) 2 M^{z)dz^j = 1, 

where the final equality follows because for exponential fam- 
ilies, Im(^) — (varA/^X) -1 . While M' g M, we have 
D(M' flin \\M ij , n ) = 0(l/n), i.e. M' is "almost" in-model 
estimator. 

Definition 5 (Squashed ML prequential model): Given M., 
constants xq £ ©mean and tiq > 0, we define the slightly 
squashed ML prequential model U by setting, for all n, all 

z n + l e Z n+1._ 

U(z n+1 | Z n ) = MiJZn+x), 

where M' is the slightly squashed ML estimator as above. 
The codelengths of the corresponding slightly squashed ML 
prequential code are not harder to calculate than those of the 
ordinary ML plug-in model and in some cases they are easier 



to calculate than the lengths of the Bayesian universal code. 
On the other hand, we show below that the slightly squashed 
ML code always achieves the optimal redundancy, satisfying 
©. 

Theorem 2: Let X, X 1 ,X 2 ,... be i.i.d.~ P, with E P [X] = 
/i*. Let A4 be a single parameter exponential family with 
sufficient statistic X and //* an element of the mean value 
parameter space. Let U denote the slightly squashed ML 
model with respect to M.. If j\A and P satisfy Condition [2] 
below, then: 

Ku(n) = ^hin + 0(1). (9) 
Condition 2: We require that the following holds both for 
T := X and T := -X: 

• If T is unbounded from above then there is a k 6 
{4, 6, . . .} such that the first k moments of T exist under 
P, that ^I M {p) = 0(/- 4 ), ^£>(M„.[[M M ) = 
0(/z 6 ) and that either Im(m) is constant or Jm(m) = 

o( M fc/2 - 3 ). 

• If T is bounded from above by a constant <? then 
£sI M {n), £zD{M^ ||M M ), and Jm(/j) are polynomial 
in l/{g- p). 

The usefulness of Theorem [2] depends on the validity of 
Condition [2] among commonly used exponential families. As 
can be seen from Figure Q] for some standard exponential 
families, our condition applies whenever the fourth moment of 
P exists. Proof: (of Theorem^ rough sketch — a detailed 
proof is in the Appendix) We express the relative redundancy 
of the slightly squashed ML plug-in code U by the sum of the 
relative redundancy of the ordinary ML plug-in code U and 
the difference in expected codelengths between U and U: 

Ku(n) = Ep[Lu(Z n )] - E P [-]nM^(Z n )} = 
E P [Lu(Z n ) - L 6 (Z n )]+K d {n) = 
EpiLuiZ^-L^Z^ + ^^^lnn + Oil), 

where the last equality follows from (|6). We have: 

Lu{Z n )-L 6 {Z n ) = 

ElTo 1 (-hxU(Zi+i | Zi) +lnU{Z i+1 | zS) = 
Er=o ( ln (!+!?)- ln (! + hWiKXi+i - A,) 2 )) ■ 



Since In (1 + i) = i + O^ 2 ), we get ^ ^ (l + &) = 
±lnn + 0(l). Denoting Vi = jrI M (fi i )(X l+1 -fii) 2 , we also 
get ln(l + Vi) = V t + 0(i~ 2 ).~Next, we consider E P [Vi}: 

^2l _ 



E P [Vi] = ±E P [iM^Xi+i — p* + //* - A, 
[I M (M (varp^X + (/i* - A*) 2 )] = 

27 



i (varp . JfBp [I M (Ai)] + e p [lM(fii)(jf - AO 2 ]) 



The second term Ep [/^(Ai)!/ 1 * — Ai) 2 ] i s 0(« 1 ) as 
£p[(/i*-Ai) 2 ] = 0(i- x ) andS[/^(A I )] = JmOO+OG- 1 ) 
(follows from expanding 7m (Ai) U P to the first order around 
/i*). Similarly, the first term is (varp t » X)Im (m*) + ^(i -1 ). 



Thus, using Im{h*) 



, we finally get: 



Fig. 1 . Fisher information, its second derivative and a fourth derivative of the 
divergence for a number of exponential families. For the normal distribution 
with fixed mean we use mean and the density of the squared outcomes is 
given as a function of the variance. 



„ 1 varp . X „ 

E P [-Hi+v z )} = -E P [v,]+o{i- 2 ) = - - y +o(r 2 ). 
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Taking all together, we see that the terms 



varp X 



cancel 



and we finally get Rjj(n) = \ Inn + 0(1). Condition [2] is 
necessary to ensure that all Taylor expansions above hold. ■ 

V. Future Work 

In future work, we hope to extend our results concerning 
the slightly squashed ML estimator to the multi-parameter 
case and establish almost-sure variation of Theorem [2] We 
also plan to analyze the estimator in the individual sequence 
framework, along the lines of [15], [16]. 
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Appendix 
Proof of Theorem Q] 

Before we show the main result, we need to prove the 
following lemmas. 

Lemma 3: Let M = {M^ | G ©mean} and V = {P M | 
/i £ Omean} be single parameter exponential families with the 
same sufficient statistic X and mean-value parameter space 
©mean- Let Go C O me an be any non-degenerate closed interval. 
Let X,Xi,X 2 ,... be i.i.d. ~ P M » for some /i* G O . Let 
p,o, /2i, fa, ■ ■ . be a sequence of estimators, such that Mi = 
p,i(z l ) and Mi £ Oo for all i > 1. Then, for Lebesgue almost 
all n* £ © : 



Therefore (using shorter notation Ep , for E r 



lim inf 

n— ¥oo 



> V 



v_-p 



Inn 

where V_ v := inf^eo wai P^ x - 

Proof: The proof is based on a theorem stated by 
Rissanen [14] (see also [1], Theorem 14.2), a special case 
of which says the following. 

Let V and 8o be defined as above, P^ be a joint 
distribution of n outcomes generated i.i.d. from P M , Q be 
an arbitrary probabilistic source, i.e. a distribution on infinite 
sequences zi,zz,... £ Z°°, and let Q( n > be its restriction 
to the first n outcomes (marginalized over z n+ i, z n+ 2, . . .). 
Define: 



g n (n*) = inf 

n' >n 



±lnn' 



(10) 



Then for Lebesgue almost all fj,* £ ©o, liirin^oo <? n (M*) > 1- 
We construct the source Q by specifying the conditional 
probabilities: 

Q(z n+1 \z n ) :=P fin , 

for every n > 1. This definition is valid, because /2„ depends 
only on z n . Now, we have: 

D(P^\\Q^) = E zn ^ n) [lni>(Z n ) - \nQ(Z n )] 

n-1 
i=0 



Expanding -D(P M * ||P^J into a Taylor series around u* yeilds: 

for some /i between Mi and u*, where we abbreviated 
L)( fc )( u ) = d^D(P^ ||p m ). The term ZK 1 )^*) is zero, since 
£>(^*||/i) as a function of u has its minimum at [i = /i* 
[12]. As is well-known [12], for exponential families the term 
P/ 2 ) (u) coincides precisely with the Fisher information I-p (/i) 
evaluated at /i. Another standard result [12] for the mean-value 
parameterization says that for all /i, 

I v ^) = —^—. (11) 
varp X 



D(pfi ] \\Q^ 



< 



2 V* 



Er=o^p, 



1 Y^n-l p 

, [(m*-m*) 2 



to 



var Pu X 



(12) 



Note, that JAp > is an infimum of a continuous and positive 
function on a compact set. From ( TTOb and ( fT2l we have: 



inf 

n'>n 



\Y^E P ^ [(ft-A.') 8 ] 



| In n' 



>5«(M*)Z 



and thus Rissanen's theorem proves the lemma. ■ 
Lemma 4: Let M.,V, ©o, -Xij -^2, • ■ ., be defined as in 
Lemma [3] Let U denote any plug-in model with respect to M. 
based on a sequence of estimators fLi,p, 2 , . . . (notice that now 
we do not restrict Mi to be in Oo, as in Lemma [3]). Then, for 
Lebesgue almost all fi* £ Qq: 

liminf E£o lj fry [D(M^\\M-^)] > 1 v v 



111 1 



2V M 



for V_ v := inf^eeo var P(1 X and V.m := sup Aieeo var Mfl X 

Proof: Let us denote Oo = [/Lto, u i]- We define a 
truncated sequence of estimators (M^) as follows: 

"1 if Pi > "1 
fii if u < Mi < «i . 
Mo if Mi < Mo 

so that M' t G © . Note, that DiM^M^) > D(M^ H-M^/), 
as there exists A 6 [0, 1] such that we can express M'; = A/U* + 
(1 - A)M l and D(M^ \\M Xfl , +{1 _ x)fl .) is strictly decreasing 
in A [1]. Using this fact and expanding D{M tl » \\Mni .) into 
Taylor series as in Lemma [3] we get: 

E Pfi , [D(M,, ||M Pi )] > E Pti , [D(M^ ||M^)] 



= TrPp 
2 " 



(ft ~ M*) 2 
var A / M X 

= 0,...,n 



> - =J— E P 



[(m,-m*) 2 



2Km 

1 and using Lemma [3] finishes 



Summing over i 
the proof. ■ 
Before we prove Theorem [T] we further need a simple 
lemma to rewrite the redundancy in a more convenient form: 
Lemma 5: Let U and Ai be defined as in Theorem Q] We 
have: 

n-l 

7Mn) = X;i5p„. [D(M„. ||M Pi )]. 



The usefulness of this lemma comes from the fact that the 
KL divergence D(-\\-) is defined as an expectation over M p . 
rather than P M *. The proof makes use of a standard result in 
the theory of exponential families and can be found e.g. in [1] 
(see also related Lemma 1 in [9]). 

Proof: (of Theorem [7]) Choose any /i* G © and span 
around it a non-degenerate closed interval 0' » C © me an, so 
that [i* £ int©^*. Fix some e > 0. It follows from general 
properties of exponential families (see, e.g., [12]) that varM M -^ 
and vai> X are continuous (with respect to /i), therefore 



if we choose the interval ©' » small enough, we will have 



— — e, with Vjp 



inf 



nee' 



varp X and 



V m := sup^gQ/ 1 var A f fi X. Using Lemma |4] with O = 0^., 
and Lemma [5] we have for Lebesgue almost all fi € Q' , . 



lim inf 



K v {n) > 1 Zy > 1 f ^P„* X 
Inn ~ 2Fa4 2 V var M„»^ 



Note, that w.l.o.g. 0^, can be chosen to have rational ends. 
The family of all intervals 0' » c ©mean with rational ends and 
rational u*, i.e. 5 = {0^, = [mo,Mi] I M*>Mo>A*i £ ©mean n 
Q}, is countable and covers me an, Ue' ®u* = ©mean- 
Therefore, 

For Lebesgue almost all u* g me an : 

TZrr(n) varp ,X 
liminf UK ' > '^—-z (13) 



In n 



varw »X 



Since this holds for every e > 0, this also means that 

■Ru(n) _ varp^X 



lim inf „_ 



> 



for Lebesgue almost all u* 6 



00 Inn — varA/ ^ X 

©mean- To show this, assume the contrary, that the set A 



lim inf „ 



< 



has positive Lebesgue 



n ~>°° Inn ^ vaTM *X 

measure, L(A) > 0. Let ei,£2,... be any sequence of 
positive numbers converging to and let us define Ai 



u* : lim inf n _> 
A 2 C and 



\ V Cj 



v; ,,, v x m r. Obviously, Ai C 

A. From continuity of measure, 
we must have L(Ai) > for i large enough, which is a 
contradiction with (foi l. The theorem is proved. ■ 

Proof of Theorem [2] 

We will make use of the following two theorems, proofs of 
which can be found in [9]. 

Theorem 6: Let X, Xi, ... be i.i.d., let jl n := (no • xq + 
Y^i=iXi)/{n + no) and n* = E[X]. If the first k moments 
of X exist, then E[(fi n - u*) fe ] = 0(n~rSl). 

Theorem 7: Let X, Xi, ... be i.i.d. random variables, define 
An := (no • £o + Yh=i x i)/( n + n o) and "* = E[X]. Let 
k 6 {0,2,4,...}. If the first k moments exists then P{\jX n — 
H*\ >S) = o(n-l^S- k y 

Before we prove the main theorem, we need the following 
lemma: 

Lemma 8: Fix any s G {0,2,4}. Let /(/i) be some contin- 
uous function of u. Suppose it holds for both T := X and 
T := -X that: 

• If T is unbounded from above then there is a k 6 
{4, 6, . . .} such that the first k moments of T exist under 
P and that /(/i) = <3(u fe - s - 2 ). 

• If T is bounded from above by a constant g then /(^u) is 
polynomial in 1/(5 — /i). 

Then the expression Ep[f(fi)(p,i — /i*) s ], for u between //* 
and /tj, is of order 0(i _s / 2 ). 

Proof: The proof follows very closely part of the proof 
of Lemma 2 in [9]; we nevertheless give here a complete proof 
for the sake of clarity. 



Let us denote Si :— fii — n*. We distinguish a number of 
regions in the value space of Sf. let A_ = (—00, 0) and let 
Ao = [0, a) for some constant value a > 0. If the individual 
outcomes X are bounded on the right hand side by a value 
g then we require that a < g and we define Ai = [a, g); 
otherwise we define Aj — [a + j — 1, a + j) for j > 1. Now 
we want to analyze asymptotic behavior of: 

Ep If&W] =J2 P ^ e A ^ Ep IfWi' I Si e A,] . 
i 

If we can establish the proper asymptotic behavior 0(i~ s ^ 2 ) 
for all regions Aj for j > 0, then we can use a symmetrical 
argument to establish the behavior for A_ as well, so it 
suffices if we restrict ourselves to j > 0. First we show it 
for Ao. In this case, the basic idea is that since the remainder 
/(n) is well-defined over the interval ji* < /.i < u* + a, 
we can bound it by its extremum on that interval, namely 
m := sup^ e[ ^ ^ +a) \f(n)\. Now we get: 

\P(Si€ Ao)E[f(fi)5 t s \ Si eA }\ < 1 • E W |/(n)|] , 

which is less or equal than raE [Si S ). Using Theorem|6]we find 
that E[Si S ] is 0(i~ s / 2 ), which is what we want. Theorem [6] 
requires that the first four moments of P exist, but this is 
guaranteed to be the case: either the outcomes are bounded 
from both sides, in which case all moments necessarily exist, 
or the existence of the required moments is part of the 
condition on the main theorem. 

Now we distinguish between the unbounded and bounded 
cases. First we assume X is unbounded from above. In this 
case, we must show, hat: 



P(5ieA j )E[f(n)5 i s \6 i eA j } = 0( 



(14) 



We bound this expression from above. The Si in the expec- 
tation is at most a + j. Furthermore f(fi) = 0(/i fc ~ s ~ 2 ) by 
assumption, where fi 6 [a + j — 1, a + j). Depending on k 
and s, both boundaries could maximize this function, but it 
is easy to check that in both cases the resulting function is 
0(j k ~ s ~ 2 ). So we bound ( TBi i from the above by: 



J2P(\Si\ >a + j-l)(a+j) s O(j 



fc-s-2 



)• 



Since we know from the condition on the main theorem that 
the first k > 4 moments exist, we can apply Theorem Q to 
find that P{\S t \ > a + j - 1) = 0(i~^\(a + j - l)~ fc ) = 
0(i~z )0(j~ k ) (since k has to be even); plugging this into 
the equation and simplifying we obtain 0(i~ 0(j~ 2 ), 
which is of order 0{i~ s / 2 ), since the sum Y^j 0{j~ 2 ) con- 
verges and k > s. 

Now we consider the case where the outcomes are bounded 
from above by g. This case is more complicated, since now 
we have made no extra assumptions as to existence of the 
moments of P. Of course, if the outcomes are bounded from 
both sides, then all moments necessarily exist, but if the 
outcomes are unbounded from below this may not be true. 



To remedy this, we map all outcomes into a new domain 
in such a way that all moments of the transformed variables 
are guaranteed to exist. Any constant x~ defines a mapping 
g(x) :— max{i",i}. We define the random variables Yi := 
g(Xi), the initial outcome yo :— g(xo) and the mapped 
analogues of fa and fa, respectively: fa is defined as the 
mean of Y under P and fa :— (y ■ n n + J2]=i ^j)/(* + n o)- 
Since fa > fa, we can bound: 



P(8i G Ai) \E{f(faS. 
< P(A 



Si e A x ] 



< PQik 



fa* > a) sup |/(m)^ s | 

M + |>a + /i*-MV SU P 1/6") I 
<5 s eAi 



By choosing x~ small enough, we can bring fa and fa* 
arbitrarily close together; in particular we can choose x~ such 
that a + fa — fa > so that application of Theorem [7] is 
safe. It reveals that the summed probability is 0(i~^) for any 
even k 6 N. Now we bound f(fa which is 0((g — fa)~ m ) 
for some m <E N by the condition on the main theorem. 
Here we use that /i < fa; the latter is maximized if all 
outcomes equal the bound g, in which case the estimator 
equals g — n (g — xo)/(i + n ) = g — O^ 1 ). Putting all of 
this together, we get sup|/(/i)| = 0((g - fa~ m ) = 0(i m ); 
if we plug this into the equation we obtain: 

... < "£0(i-i)g s O(i m )=g s J20(i m -%) 

i i 

This is of order 0{i~ s l 2 ) if we choose k > 6m+s. We can do 
this because the construction of <?(•) ensures that all moments 
exist, and therefore certainly the first 6m + s moments. ■ 
We can now proceed to prove the theorem: 

Proof: (of Theorem^ We express the relative redundancy 
of the slightly squashed ML plug-in code U by the sum of the 
relative redundancy of the ordinary ML plug-in code U and 
the difference in expected codelengths between U and U: 

■Ru{n) = E P [Lu(Z n )} - E P [-hxM^{Z n )]. 
= EpiLuiZ'^-L^Z^+TZ^n) 

= E P [Lu(Z") - L v (Z n )} + \ ™ P \ Inn + Q(l), 

where the last equality follows from (O, which is valid under 
the conditions imposed on j-^DiM^* \\Mfa) (see Condition 1 
in [9] for details). We have: 

Lu(Z n ) - L (Z n ) 
= Eto (- ln U ( Z i+l I + In U(Z i+1 | Zi)) 
= E^o (In (1 + h) ~ ln (! + TilM((H)(Xi+i - Ai) 2 )) ■ 
Since ln (l + ±) = ± + 0(i~ 2 ), we have: 



2i) 
n-1 

£ 

i=0 



ln 1 



1 

2i 



= — ln n - 
2 



0(1). 



(15) 



To analyze the second term in the sum, we use the fact that 
for arbitary a > 0: 

—a < — ln(l + a) < —a + -a 2 , 



which follows e.g. from expanding the logarithm into Taylor 
expansion up to the second order. In our case, a = Vi := 
j-J M (fa){X l+1 -fa) 2 . We will show that E P [V 2 } is 0{r 2 ), 
and then E P [- ln(l + V-)] = -E P [Vi] + 0(^ 2 ). We have: 



E P [V 2 \ = 
1 E P 



I%t (Ai 



J jvi(Ai) C^+i _ Ai) 
Ex i+1 , 



4,2 - 



Im(Jm) 



(X i+ i - fa + fa - fa 
GSfwarp^X + St 



where nip' is Ep[(X — fa) k ], the fc-th central moment 
of P M *, and Si — fa — fa. We will show that the terms 
under expectation are bounded. If Ijviifa) i s constant, then 
we apply Theorem [6] with k = 1, k = 2 and k = 4 
to the second, third and fourth term, respectively and thus 
all the terms under expectation are 0(1). If /»(Ai) i s not 
constant, then by Condition [2] the assumptions of Lemma [8] 
are satisfied with f(fa) = Iji^(fa) and s = 0,2,4. Applying 
the lemma subsequently to the first, third and fourth term 
(with s — 0,2,4, respectively), we see that all those terms 
are 0(1). The second term is also O(l) by applying Lemma 
[8] once again with f(fa = /iijL(/i) and s = (assumptions 
are again satisfied by Condition [2]). Thus, we showed that 
E P [V?] = 0(i- 2 ). 

Next, we consider i?p[Vi]: 

E P [Vi\ = ^E P [l M (fa)(X l+1 -fa+ fa - fa) 2 ] 



2i 
1 

27 



E P [l M (fa){™ P ^X + 6 2 )] 
(yup^XEp [I M {fa)] + E P [l M (fa)5 2 ]) 



The second term Ep [lM(fa)S 2 ] is 0(i _1 ) by Lemma [8] 
applied with f(fa) = /m(m) an( l s = 2. To analyze the first 
term we expand Ij^{fa) into Taylor series around fa: 



E P [I M (fa)} =I M {V*)+Ep 



-lM(fa*)Si + -o~Im{v)S1 



for some /j, between fa and fa. The linear term in the 
expansion is 0(i _1 ) by Theorem [6] applied with k = 1. 
The quadratic term is 0(i _1 ) by applying Lemma [8] with 



/(/*) 



d 2 [j. 



Im(m) an( l s = 2; Condition |2] guarantees that 



assumptions of the lemma are satisfied. Thus, using ( fTTT i: 



E P m=^i M (fa 

so that: 

E P [- ln(l + 



ivarp ,X+0(r 2 )-- 



1 varp ,X 

M 

2i var M „,X 



Vi)} = 



1 varp . X 



2i var M(j , X 
Taking together (fT&b and ( Tl3T > we have: 



+ 0(z- 2 ), 



L v {Z n ) - L V (F 



1 1 varp „X 

— ln n ln n - 

2 2var M „»X 



+o(t- 2 ), 



(16) 



0(1), 



and thus: 



Ru{n) = - Inn 



0(1). 



